AITopics

2509.01418

Country:

South America (1.00)
Europe (1.00)
Africa (1.00)
(2 more...)

Genre:

Research Report > New Finding (1.00)
Questionnaire & Opinion Survey (1.00)

Industry: Government (0.92)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.76)

Hong, Harbin, Caldas, Sebastian, Leqi, Liu

Hypothesis Testing for Quantifying LLM-Human Misalignment in Multiple Choice Settings

arXiv.org Artificial IntelligenceJun-19-2025

As Large Language Models (LLMs) increasingly appear in social science research (e.g., economics and marketing), it becomes crucial to assess how well these models replicate human behavior. In this work, using hypothesis testing, we present a quantitative framework to assess the misalignment between LLM-simulated and actual human behaviors in multiple-choice survey settings. This framework allows us to determine in a principled way whether a specific language model can effectively simulate human opinions, decision-making, and general behaviors represented through multiple-choice options. We applied this framework to a popular language model for simulating people's opinions in various public surveys and found that this model is ill-suited for simulating the tested sub-populations (e.g., across different races, ages, and incomes) for contentious questions. This raises questions about the alignment of this language model with the tested populations, highlighting the need for new practices in using LLMs for social science studies beyond naive simulations of human subjects.

artificial intelligence, large language model, natural language, (18 more...)

2506.14997

Country: North America > United States > Texas (0.28)

Genre: Research Report > New Finding (0.46)

Industry: Education (0.92)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

arXiv.org Artificial IntelligenceMar-4-2025

Better Aligned with Survey Respondents or Training Data? Unveiling Political Leanings of LLMs on U.S. Supreme Court Cases

Xu, Shanshan, Santosh, T. Y. S. S, Elazar, Yanai, Vogel, Quirin, Plank, Barbara, Grabmair, Matthias

The increased adoption of Large Language Models (LLMs) and their potential to shape public opinion have sparked interest in assessing these models' political leanings. Building on previous research that compared LLMs and human opinions and observed political bias in system responses, we take a step further to investigate the underlying causes of such biases by empirically examining how the values and biases embedded in training corpora shape model outputs. Specifically, we propose a method to quantitatively evaluate political leanings embedded in the large pretraining corpora. Subsequently we investigate to whom are the LLMs' political leanings more aligned with, their pretrainig corpora or the surveyed human opinions. As a case study, we focus on probing the political leanings of LLMs in 32 U.S. Supreme Court cases, addressing contentious topics such as abortion and voting rights. Our findings reveal that LLMs strongly reflect the political leanings in their training data, and no strong correlation is observed with their alignment to human opinions as expressed in surveys. These results underscore the importance of responsible curation of training data and the need for robust evaluation metrics to ensure LLMs' alignment with human-centered values.

alignment, computational linguistic, correlation, (16 more...)

2502.18282

Country:

North America > United States > California (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
Asia > Thailand > Bangkok > Bangkok (0.04)
(23 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Questionnaire & Opinion Survey (1.00)

Industry:

Law > Government & the Courts (1.00)
Government > Voting & Elections (1.00)
Government > Regional Government > North America Government > United States Government (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Wang, Yuxia, Wang, Minghan, Nakov, Preslav

Rethinking STS and NLI in Large Language Models

arXiv.org Artificial IntelligenceFeb-4-2024

Recent years have seen the rise of large language models (LLMs), where practitioners use task-specific prompts; this was shown to be effective for a variety of tasks. However, when applied to semantic textual similarity (STS) and natural language inference (NLI), the effectiveness of LLMs turns out to be limited by low-resource domain accuracy, model overconfidence, and difficulty to capture the disagreements between human judgements. With this in mind, here we try to rethink STS and NLI in the era of LLMs. We first evaluate the performance of STS and NLI in the clinical/biomedical domain, and then we assess LLMs' predictive confidence and their capability of capturing collective human opinions. We find that these old problems are still to be properly addressed in the era of LLMs.

chatgpt, llama-2, llm, (15 more...)

2309.08969

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Europe > Slovenia > Central Slovenia > Municipality of Ljubljana > Ljubljana (0.04)
(5 more...)

Genre: Research Report (0.64)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.54)

#artificialintelligenceAug-19-2022, 11:52:33 GMT

What is supervised machine learning?

Were you unable to attend Transform 2022? Check out all of the summit sessions in our on-demand library now! The training process for artificial intelligence (AI) algorithms is designed to be largely automated innately. There are often thousands, millions or even billions of data points and the algorithms must process all of them to search for patterns. In some cases, though, AI scientists are finding that the algorithms can be made more accurate and efficient if humans are consulted, at least occasionally, during the training.

algorithm, dataset, ml algorithm, (16 more...)

Country: North America > United States > California > San Francisco County > San Francisco (0.15)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

#artificialintelligenceAug-2-2022, 17:09:57 GMT

What is supervised machine learning?

algorithm, dataset, ml algorithm, (16 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

arXiv.org Artificial IntelligenceOct-8-2020

What Can We Learn from Collective Human Opinions on Natural Language Inference Data?

Nie, Yixin, Zhou, Xiang, Bansal, Mohit

Despite the subjective nature of many NLP tasks, most NLU evaluations have focused on using the majority label with presumably high agreement as the ground truth. Less attention has been paid to the distribution of human opinions. We collect ChaosNLI, a dataset with a total of 464,500 annotations to study Collective HumAn OpinionS in oft-used NLI evaluation sets. This dataset is created by collecting 100 annotations per example for 3,113 examples in SNLI and MNLI and 1,532 examples in Abductive-NLI. Analysis reveals that: (1) high human disagreement exists in a noticeable amount of examples in these datasets; (2) the state-of-the-art models lack the ability to recover the distribution over human labels; (3) models achieve near-perfect accuracy on the subset of data with a high level of human agreement, whereas they can barely beat a random guess on the data with low levels of human agreement, which compose most of the common errors made by state-of-the-art models on the evaluation sets. This questions the validity of improving model performance on old metrics for the low-agreement part of evaluation datasets. Hence, we argue for a detailed examination of human agreement in future data collection efforts, and evaluating model outputs against the distribution over collective human opinions. The ChaosNLI dataset and experimental scripts are available at https://github.com/easonnie/ChaosNLI

annotation, machine learning, natural language, (18 more...)

2010.03532

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > North Carolina (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.46)

#artificialintelligenceMar-20-2017, 03:00:16 GMT

When beauty is in the eye of the (robo)beholder

For over a year, I worked as a beauty editor, writing and researching about the products, trends, and people that make us want to look a certain way. And as research for many of the stories I wrote, I consulted with dermatologists, plastic surgeons, makeup artists, aestheticians, and more trying to answer a simple question--how can I make myself more conventionally attractive? "Beauty is confidence," they'd always say, prefacing the real answer. Inevitably, these experts would eventually tell me that you feel more confident, and thus more beautiful, when you look blemish- and wrinkle-free. Naturally, the problem here is the premise.

algorithm, beholder, plastic surgeon, (12 more...)

Country: North America > United States > California > Orange County > Irvine (0.05)

Industry:

Health & Medicine > Surgery > Plastic & Reconstructive Surgery (0.36)
Health & Medicine > Therapeutic Area > Dermatology (0.35)

Technology:

Information Technology > Artificial Intelligence (0.71)
Information Technology > Communications > Social Media (0.32)

#artificialintelligenceSep-9-2016, 01:38:51 GMT

Beauty.AI 2.0 Winners

The second beauty contest, where humans are judged by the robots completes with over six thousand images evaluated by the five robot judges. In addition to the panel of judges from the first contest, Beauty.AI 2.0 featured three new robot judges including: "Average Face" built on the hypothesis that the closer the face is to the average face within the ethnic group, the more attractive it is "AntiAgeist" evaluating the difference between the predicted and actual chronological age "PIMPL" evaluating the number and distribution of pimples and other dark spots (but not freckles) The results were sent to the individual participants via secure link and winners were announced at http://winners2.beauty.ai/#win . The results were surprising, since the consensus scores provided by the robot jury disagreed with human opinion. Tens of participants responded with angry emails criticizing the winners selected by the robot jury. Statements including "what is your "robot" worth??? One walk through a shopping-mall and I will discover more attractive people vs. that ones "won" your Beauty Contest", "If this is how I will be judged in the future, I don't want to see it", "You need human opinion" were among the most pleasant ones with rare positive comments including "this contest is a confidence booster!".

artificial intelligence, human opinion, youth laboratory, (10 more...)

Country: Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.06)

Genre: Contests & Prizes (0.59)

Industry: Health & Medicine (0.32)

Technology: Information Technology > Artificial Intelligence > Robots (1.00)